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We evaluate IBM's Enhanced Cell Broadband Engine (BE) as a possible building block of a new 
generation of lattice QCD machines. The Enhanced Cell BE will provide full support of double- 
precision floating-point arithmetics, including IEEE-compliant rounding. We have developed a 
performance model and applied it to relevant lattice QCD kernels. The performance estimates are 
supported by micro- and application-benchmarks that have been obtained on currently available 
Cell BE-based computers, such as IBM QS20 blades and PlayStation 3. The results are encour- 
aging and show that this processor is an interesting option for lattice QCD applications. For a 
massively parallel machine on the basis of the Cell BE, an appUcation-optimized network needs 
to be developed. 
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1. Introduction 



The initial target platform of the Cell BE was the PlayStation 3, but the processor is currently 
also under investigation for scientific purposes It delivers extremely high floating-point 

(FP) performance, memory and I/O bandwidths at an outstanding price-performance ratio and low 
power consumption. 

We have investigated the Cell BE as a potential compute node of a next-generation lattice 
QCD machine. Although the double precision (DP) performance of the current version of the Cell 
BE is rather poor, the announced Enhanced Cell BE version (2008) will have a DP performance of 
~ 100 GFlop/s and also implement IEEE-compliant rounding. We have developed a performance 
model of a relevant lattice QCD kernel on the Enhanced Cell BE and investigated several possible 
data layouts. The applicability of our model is supported by a variety of benchmarks performed 
on commercially available platforms. We also discuss requirements for a network coprocessor that 
would enable scalable parallel computing using the Cell BE. 



2. The Cell Broadband Engine 

An introduction to the processor can be found in Ref. [^, and a schematic diagram is shown 
in Fig. [H The architecture is described in detail in Ref. [Q], and we only give a brief overview here. 

The Cell BE comprises one PowerPC Processor Element (PPE) and 8 Synergistic Processor 
Elements (SPE). In the following we will assume that performance-critical kernels are executed on 
the SPEs and that the PPE will execute control threads. Therefore, we only consider the perfor- 
mance of the SPEs. Each of the dual-issue, in-order SPEs runs a single thread and has a dedicated 
256 kB on-chip memory (local store = LS) which is accessible by direct memory access (DMA) 
or by local load/store operations to/from the 128 general purpose 128-bit registers. An SPE can 
execute two instructions per cycle, performing up to 8 single precision (SP) operations. Thus, the 
aggregate SP peak performance of all 8 SPEs on a single Cell BE is 204.8 GFlop/s at 3.2 GHz.^ 
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Figure 1: Main functional units of the Cell BE (see Ref. 1^] for details). Bandwidth values are given for a 
3.2 GHz system clock. 



Available systems use clock frequencies of 2.8 or 3.2 GHz. In our estimates we assume 3.2 GHz. 
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Figure 2: Data-flow paths and associated execution times Ti. For simplicity, only a single SPE is shown. 



The cunent version of the Cell BE has an on-chip memory controller supporting dual-channel 
access to the Rambus XDR main memory (MM), which will be replaced by DDR2 for the Enhanced 
Cell BE. The configurable I/O interface supports a coherent as well as a non-coherent protocol on 
the Rambus FlexIO channels.^ Internally, all units of the Cell BE are connected to the coherent 
element interconnect bus (EIB) by DMA controllers. 

3. Performance model 

To theoretically investigate the performance of the Cell BE, we use a refined performance 
model along the lines of Refs. [Q, ^. Our abstract model of the hardware architecture considers 
two classes of devices: (/) Storage devices: These store data and/or instructions (e.g., registers or 
LS) and are characterized by their storage size. (//) Processing devices: These act on data (e.g., 
FP units) or transfer data/instructions from one storage device to another (e.g., DMA controllers, 
buses, etc.) and are characterized by their bandwidths j8; and startup latencies A;. 

An application algorithm, implemented on a specific machine, can be broken down into dif- 
ferent computational micro-tasks which are performed by the processing devices of the machine 
model described above. The execution time 7] of each task / is estimated by a linear ansatz 

7;~/,/A- + ^(A;), (3.1) 

where 7, quantifies the information exchange, i.e., the processed data in bytes. 

Assuming that all tasks are running concurrently at maximal throughput and that all depen- 
dencies (and latencies) are hidden by suitable scheduling, the total execution time is 

Ts^sC^im&y.Ti. (3.2) 

We denote by Tpeak the minimal compute time for the FP operations of an application that could 
be achieved with an ideal implementation (i.e., saturating the peak FP throughput of the machine, 
assuming also perfect matching between its instruction set architecture and the computation). The 
floating point efficiency £fp for a given application is then defined as £fp = rpeak/^exe- 

In our analysis, we have estimated the execution times 7] for data processing and transport 
along all data paths indicated in Fig. ||, in particular: 

^In- and outbound bandwidths will be symmetric on the Enhanced Cell BE, namely 25.6 GB/s each. 
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• floating-point operations, Tpp 

• load/store operations between register file (RF) and LS, Trf 

• off-chip memory access, Tmem 

• internal communications between SPEs on the same Cell BE, Tint 

• external communications between different Cell BEs, Text 

• transfers via the EIB (memory access, internal and external communications), Teib 
Unless stated otherwise, all hardware parameters /3( are taken from the Cell BE manuals [Q]. 

4. Linear algebra kernels 

As a simple application of our performance model and to verify our methodology, we analyzed 
various linear algebra computations. As an example, we discuss here only a caxpy operation: 
c -xir -\-^' with complex c and complex spin-color vectors ^ and ^f' . If the vectors are stored 
in main memory (MM), the memory bandwidth dominates the execution time, Texe ~ T^em, and 
limits the FP performance of the caxpy kernel to Eyp < 4.1%. On the other hand, if the vectors 
are held in the LS, arithmetic operations and LS access are almost balanced (rpeak/TLs =2/3). In 
this case, a more precise estimate of Tpp also takes into account constraints from the instruction set 
architecture of the Cell BE for complex arithmetics and yields a theoretical limit of epp < 50%. 

We have verified the predictions of our theoretical model by benchmarks on several hardware 
systems (Sony PlayStation 3, IBM QS20 Blade Server and Mercury Cell Accelerator Board). In 
both cases (data in MM and LS) the theoretical time estimates are well reproduced by the measure- 
ments. Careful optimization of arithmetic operations^ is required only in the case in which all data 
are kept in the LS (or, in general, if Texe ~ Tpp). 



5. Lattice QCD kernel 

The Wilson-Dirac operator is the kernel most relevant for the performance of lattice QCD 
codes. We considered the computation of the 4-d hopping term 

V^^= I {u.A^ + Yi,)Wx+fi+ul^^^{l-y^)^if,^ , (5.1) 

where x = {xi,X2,x^,X4.) is a 4-tuple of space-time coordinates labeling the lattice sites, Yx 
\j/x are complex spin-color vectors assigned to the lattice site x, and Ux,^ is an SU(3) color matrix 
assigned to the link from site x in direction /t. 



The computation of Eq. (5.1) on a single lattice site amounts to 1320 floating-point opera- 
tions."^ On the Enhanced Cell BE this yields Tp^ak = 330 cycles per site (in DP). However, the 
implementation of Eq. ( ^7\\ ) requires at least 840 multiply-add operations and Tpp > 420 cycles 



per lattice site to execute. Thus, any implementation of Eq. (5.1) cannot exceed 78% of the peak 
performance of the Cell BE. 



^We implemented our benchmarks of arithmetic operations in single precision. However, the theoretical analysis 
presented here refers to double precision on the Enhanced Cell BE. 

^We do not include sign flips and complex conjugation in the FLOP counting. 
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The time spent on possible remote communications and on load/store operations for the oper- 



ands (9x12 + 8x9 complex numbers) of the hopping term ( p.l| ) strongly depends on the details 
of the lattice data layout. We assign to each Cell BE a local lattice with Vceii = Li x L2 x L3 x L4 
sites, and the 8 SPEs are logically arranged as ^'i x ^2 x S3 x ^'4 = 8. Thus, each single SPE holds a 
subvolume of Vspe = {L\/s\) x (L2/S2) x {l-i/sj,) x {L^^/six) = Vceii/8 sites. Each SPE on average 
has Aint neighboring sites on other SPEs within and Agxt neighboring sites outside a Cell BE. 

We consider a communication network with the topology of a 3-d torus. We assume that 
the 6 inbound and the 6 outbound links can simultaneously transfer data, each at a bandwidth of 
Aink = 1 GB/s, and that a bidirectional bandwidth of j3ext = 6 GB/s is available between each Cell 
BE and the network. This could be realized by attaching an efficient network controller via the 
FlexIO interface. We have investigated different strategies for the lattice and data layout: Either all 
data are kept in the on-chip local store of the SPEs, or the data reside in off-chip main memory. 

Data in on-chip memory (LS) 

We require that all data for a compute task can be kept in the LS of the SPEs. Since loading 
of all data into the LS at startup is time-consuming, the compute task should comprise a sizable 
fraction of the application code. In QCD this can be achieved, e.g., by implementing an entire 



iterative solver with repeated computation of Eq. (5.1). Apart from data, the LS must also hold a 
minimal program kernel, the run-time environment, and intermediate results. Therefore, the storage 
requirements strongly constrain the local lattice volumes Vspe and Vceii. 

The storage requirement of a spinor field is 24 real words (192 Byte in double precision) 
per site, while a gauge field Ux,^ needs 18 words (144 Byte) per link. Assuming that for a solver 
we need storage corresponding to 8 spinors and 3x4 links per site, the subvolume carried by a 
single SPE cannot be larger than about Vspe = 79 lattice sites. Moreover, one lattice dimension, 
say the 4-direction, must be distributed locally within the same Cell BE across the SPEs (logically 
arranged as an 1^ x 8 grid). Then, L4 corresponds to a global lattice extension and, as a pessimistic 
assumption, may be as large as L4 = 64. This yields a very asymmetric local lattice^ with Vceii = 
2^ X 64 and Vspe = 2^ x 8. 

Data in off-chip main memory (MM) 

When all data are stored in MM, there are no a-priori restrictions on Vceii. On the other 



hand, we need to minimize redundant memory accesses to reload the operands of Eq. ( p.l[ ) into 
the LS when sweeping through the lattice. To also allow for concurrent FP computation and data 
transfers (to/from MM or remote SPEs), we consider a multiple buffering scheme.^ A possible 
implementation of such a scheme is to compute the hopping term ( p| ) on a 3-d slice of the local 
lattice and then move the slice along the 4-direction. Each SPE stores all sites along the 4-direction, 
and the SPEs are logically arranged as a 2^* x 1 grid to minimize internal and to balance external 
communications between SPEs. If the U - and i/A-fields associated with all sites of three 3-d slices 



can be kept in the LS at the same time, all operands in Eq. ( |5.1| ) are available in the LS. This 



optimization requirement again constrains the local lattice size, now to Vceii ~ 800 x L4 sites. 



^When distributed over 4096 Cell BEs, this corresponds to a global lattice size of 32^ X 64. 

*In multiple buffering schemes several buffers are used in an alternating fashion to either process or load/store data. 
This requires additional storage (here in the LS) but allows for concurrent computation and data transfer. 
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Table 1: Comparison of the theoretical time estimates 7/ (in 1000 SPE cycles) for some micro-tasks arising 



in the computation of Eq. (5.1) for different lattice data layouts: keeping data either in the on-chip LS (left 
part) or in the off-chip MM (right part). The first rows indicate the corresponding number of neighbor sites 
Aint and Aext- Estimated efficiencies, epp = Tpeak/ max,- 7/, are shown in the last row. 



The predicted execution times for some of the micro-tasks considered in our model are given 
in Table |l| for both data layouts and for reasonable choices of the local lattice size. If all data are 
kept in the LS, the theoretical efficiency of about 27% is limited by the communication bandwidth 
(Texe ~ Text)- This is also the limiting factor for the smallest local lattice with data kept in MM, 
while for larger local lattices the memory bandwidth becomes the limiting factor (Jexe ~ T^cm)- 



We have performed hardware benchmarks with the same memory access pattern as ( p.l[ ), using 
the above multiple buffering scheme for data from MM. We found that the execution times were at 
most 20% higher than the theoretical predictions for T^^m- 

6. Performance model and benchmarks for DMA transfers 

DMA transfers determine Tmem. Tmi, and Text, and their optimization is crucial to exploit the 
Cell BE performance. Our analysis of detailed micro-benchmarks, e.g., for LS-to-LS transfers. 



shows that the linear model Eq. (|3JJ) does not accurately describe the execution time of DMA 
operations with arbitrary size I and address alignment. We refined our model to take into account 
the fragmentation of data transfers, as well as source and destination addresses. A., and A^, of the 

buffers: 128 b t 

7bMA(/,A,A/) =A° + A"-A^«(/,A„A^)+A^i(7,A,)-^-^. (6.1) 

Each LS-to-LS DMA transfer has a latency of A" 200 cycles (from startup and wait for comple- 
tion). The DMA controllers fragment all transfers into Nb 128-byte blocks aligned at LS lines (and 
corresponding to single EIB transactions). When 5 A = Av — A^ is a multiple of 128, the source LS 
lines can be directly mapped onto the destination LS lines. Then, we have Na = 0, and the effective 
bandwidth j3eff = // (Tdma — A") is approximately the peak value. Otherwise, if the alignments do 
not match {8A not a multiple of 128), an additional latency of A" f« 16 cycles is introduced for each 
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Figure 3: Execution time of LS-to-LS copy operations as a function of the transfer size. In the left panel 
source and destination addresses are aligned, while in the right panel they are misaligned. Filled diamonds 
show the measured values on an IBM QS20 system. Dashed and full lines correspond to the theoretical 



prediction from Eq. (3.1) and Eq. (6.1 ), respectively. 



transferred 128-byte block, reducing jSeff by about a factor of two. Fig. ^ illustrates how clearly 
these effects are observed in our benchmarks and how accurately they are described by Eq. (|0|). 

7. Conclusion and outlook 

Our performance model and hardware benchmarks indicate that the Enhanced Cell BE is a 
promising option for lattice QCD. We expect that a sustained performance above 20% can be 
obtained on large machines. A refined theoretical analysis, e.g., taking into account latencies, and 
benchmarks with complete application codes are desirable to confirm our estimate. Strategies to 
optimize codes and data layout can be studied rather easily, but require some effort to implement. 

Since currently there is no suitable southbridge for the Cell BE to enable scalable parallel com- 
puting, we plan to develop a network coprocessor that allows us to connect Cell BE nodes in a 3-d 
torus with nearest-neighbor links. This network coprocessor should provide a bidirectional band- 
width of 1 GB/s per link for a total bidirectional network bandwidth of 6 GB/s and perform remote 
LS-to-LS copy operations with a latency of order 1 /xs. Pending funding approval, this development 
will be pursued in collaboration with the IBM Development Lab in Boblingen, Germany. 
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